by Yi (Michelle) Deng
========================================================
This report explores a white wine dataset containing quality evaluations and attributes for 4898 white wines.
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
This dataset consists of 12 variables, with 4898 observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
## [1] 20
## [1] 5
The quality scores range from 3 to 9. Most wines have a score of 5 to 7. The quality distribution appears normal with the peak of 6. Twenty worst wines are scored 3, and only 5 best wines are scored 9.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
There are a few wines with extremely high fix.acidity. After omitting the top 0.1% values, the distribution of fix.acidity appears normal, with the peak around 6.75.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
There are a few wines with extremely high volatile.acidity. After omitting the top 1% values, the distribution of volatile.acidity appears normal, with the peak around 2.50.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
There are a few wines with extremely high citric.acidity. After omitting the top 1% values, the distribution of citric.acidity appears normal, with the peak around 0.3. It is noted that there is another sharp peak at 0.49, with more than 200 counts, which are all scored as 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Transformed the long tail data to better understand the distribution of residual.sugar. The tranformed residual.sugar distribution appears bimodal with the peak around 1 or so and again at 8 or so.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Transformed the long tail data to better understand the distribution of chlorides. The tranformed chlorides distribution appears normal with the peak around 0.05 or so.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
There are a few wines with extremely high free.sulfur.dioxide. After omitting the top 1% values, the distribution of free.sulfur.dioxide appears normal, with the peak around 30.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
After omitting the top 0.1% values, the distribution of total.sulfur.dioxide appears normal, with the peak around 125.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
After omitting the top 0.1% values, the distribution of density appears normal, with the peak around 0.992 to 0.995.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The distribution of pH appears normal, with the peak around 3.15.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The distribution of sulphates appears normal, with the peak around 0.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
There are 4898 white wines in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality).
Most white wines are scored 6. Twenty worst wines are scored 3, and only 5 best wines are scored 9.
The median fixed.acidity is 6.80, ranging from 3.80 to 14.20 The median volatile.acidity is 0.26, ranging from 0.08 to 1.10 The median citric.acid is 0.32, ranging from 0 to 1.66 The median residual.sugar is 5.20, ranging from 0.60 to 65.80 The median chlorides is 0.43, ranging from 0.01 to 0.35 The median free.sulfur.dioxide is 34.00, ranging from 2.00 to 289.00 The median total.sulfur.dioxide is 134.00 ranging 9.00 to 440.00 The median density is 0.994, ranging from 0.987 to 1.039 The median pH is 3.18, ranging from 2.72 to 3.82 The median sulphates is 0.47, ranging from 0.22 to 1.08 The median alcohol is 10.40, ranging from 8.00 to 14.20
The main feature in the data set is quality. The quality rating is an evaluation outcome feature of each white wine. I’d like to determine which features are best for predicting the quality of a white wine. Since all other features are continuos variables, it is hard to say which one is a better candidate at this moment.
All other 11 features will likely contribute to the quality of a white wine. They will be further exam in the following bivariate and multivariate analyses.
No.
I log-transformed the right skewed residual.sugar and chlorides distributions.
The tranformed distribution for residual.sugar appears bimodal with the residual.sugar peaking around 1 or so and again at 8 or so. There are few white wines with log10(residual.sugar) at around 0.5.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
The quality tends to postively correlate with alcohol, and negatively correlate with density. The total.sulfur.dioxide, residual.sugar, density, alcohol tend to correlate with each other. The higher the alcohol, then the lower the density, the lower the residual.sugar, the lower the total.sulfur.dioxide. The total.sulfur.dioxide also tends to positively correlate with the free.sulfur.dioxide. The pH tends to negatively correlate with the fixed.acidity.
From a subset of the data, only alcohol and density seems to moderately correlate with quality. However, since other features like total.sulfur.dioxide, residual.sugar, density, alcohol tend to correlate with each other, I would like to take a closer look at scatter plots of these inter-correlated features.
It is hard to see the relationship between alcohol and quality from the scatter plot. Therefore, I put the quality into an ordered factor. The relationship appears to be nonlinear, with a drop at score 5.
##
## Call:
## lm(formula = quality ~ alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5317 -0.5286 0.0012 0.4996 3.1579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.582009 0.098008 26.34 <2e-16 ***
## alcohol 0.313469 0.009258 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
Despite the fact that the relationship looks nonlinear, based on the R^2 value, alcohol explains about 19 percent of the variance in quality.
##
## Call:
## lm(formula = quality ~ density, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1441 -0.6258 0.0005 0.5162 4.2102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.277 4.003 24.05 <2e-16 ***
## density -90.942 4.027 -22.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8429 on 4896 degrees of freedom
## Multiple R-squared: 0.09432, Adjusted R-squared: 0.09414
## F-statistic: 509.9 on 1 and 4896 DF, p-value: < 2.2e-16
The relationship appears to be nonlinear. However, based on the R^2 value, density explains about 9 percent of the variance in quality.
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$residual.sugar
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3776791 0.4246712
## sample estimates:
## cor
## 0.4014393
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$free.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$density
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5094349 0.5497297
## sample estimates:
## cor
## 0.5298813
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$alcohol
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4709775 -0.4262443
## sample estimates:
## cor
## -0.4488921
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
##
## Pearson's product-moment correlation
##
## data: wine$pH and wine$fixed.acidity
## t = -32.934, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4485154 -0.4026542
## sample estimates:
## cor
## -0.4258583
After excluding the top 0.1% values of each feature, these scatter plots indicate inter-correlations among chemical properties, which may together influence the quality of white wines.
There is no strong direct correlation between the quality and other features. Alcohol tends to positively correlate with the quality, with a moderate correlation coefficience (r=0.436). Density tends to negatively correlate with the quality, with correlation coefficience equals to -0.307.
Based on the R^2 value, alcohol explains about 19 percent of the variance in quality, while density explains about 9 percent of the variance. Other features of interest can be incorporated into the model to explain other variance in the quality.
The total.sulfur.dioxide, residual.sugar, density and alcohol inter-correlate with each other. The higher the alcohol, then the lower the density, the lower the residual.sugar, the lower the total.sulfur.dioxide.
The total.sulfur.dioxide also positively correlates with the free.sulfur.dioxide. The higher the total.sulfur.dioxide, then the higher the free.sulfur.dioxide, which makes sense.
The pH negatively correlates with the fixed.acidity. The lower the pH, then the higher the fixed.acidity, which makes sense.
The residual.sugar is strongly and postively correlated with the density (r= 0.839). The density is strongly and negatively correlated with the alcohol (r= -0.780).
Levels of quality cluster by alcohol and density values. In general, higher quality scores locate at the top left, with higher alcohol value and lower density value.
When adding the quality against the residual.sugar vs. density relationship, I notice that if we account for constant density value, higher residual.sugar value associates with a higher quality score.
Levels of quality cluster by alcohol and residual.sugar values. In general, higher quality scores locate at the bottom right, with higher alcohol value and lower residual.sugar value.
Quality does not correlate with pH and fixed.acidity. Nothing particularly stands out.
A linear model using those variables may be useful to predict the quality of a a white wine.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = subset(wine, alcohol <=
## quantile(wine$alcohol, 0.999)))
## m2: lm(formula = quality ~ alcohol + density, data = subset(wine,
## alcohol <= quantile(wine$alcohol, 0.999)))
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = subset(wine,
## alcohol <= quantile(wine$alcohol, 0.999)))
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity,
## data = subset(wine, alcohol <= quantile(wine$alcohol, 0.999)))
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides, data = subset(wine, alcohol <= quantile(wine$alcohol,
## 0.999)))
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides + total.sulfur.dioxide, data = subset(wine, alcohol <=
## quantile(wine$alcohol, 0.999)))
## m7: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity, data = subset(wine,
## alcohol <= quantile(wine$alcohol, 0.999)))
## m8: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + sulphates,
## data = subset(wine, alcohol <= quantile(wine$alcohol, 0.999)))
## m9: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + sulphates +
## pH, data = subset(wine, alcohol <= quantile(wine$alcohol,
## 0.999)))
## m10: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + sulphates +
## pH + citric.acid, data = subset(wine, alcohol <= quantile(wine$alcohol,
## 0.999)))
## m11: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides + total.sulfur.dioxide + fixed.acidity + sulphates +
## pH + free.sulfur.dioxide, data = subset(wine, alcohol <=
## quantile(wine$alcohol, 0.999)))
##
## ==============================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.510*** 90.296*** 74.262*** 73.304*** 81.387*** 60.364*** 86.100*** 162.880*** 163.355*** 150.001***
## (0.098) (6.169) (12.377) (11.979) (12.001) (12.248) (14.113) (14.728) (18.574) (18.612) (18.765)
## alcohol 0.313*** 0.360*** 0.246*** 0.286*** 0.282*** 0.283*** 0.305*** 0.274*** 0.183*** 0.182*** 0.194***
## (0.009) (0.015) (0.018) (0.018) (0.018) (0.018) (0.019) (0.020) (0.024) (0.024) (0.024)
## density 24.746*** -87.870*** -71.580*** -70.543*** -78.816*** -57.521*** -83.546*** -163.031*** -163.516*** -150.085***
## (6.083) (12.320) (11.925) (11.951) (12.211) (14.126) (14.756) (18.844) (18.884) (19.034)
## residual.sugar 0.053*** 0.052*** 0.052*** 0.053*** 0.045*** 0.056*** 0.087*** 0.087*** 0.081***
## (0.005) (0.005) (0.005) (0.005) (0.005) (0.006) (0.007) (0.007) (0.008)
## volatile.acidity -2.062*** -2.047*** -2.080*** -2.096*** -2.049*** -1.969*** -1.960*** -1.870***
## (0.109) (0.110) (0.110) (0.110) (0.110) (0.110) (0.112) (0.112)
## chlorides -0.696 -0.773 -0.861 -0.790 -0.156 -0.180 -0.236
## (0.540) (0.540) (0.540) (0.538) (0.544) (0.547) (0.543)
## total.sulfur.dioxide 0.001** 0.001** 0.001* 0.001* 0.001* -0.000
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## fixed.acidity -0.045** -0.029 0.067** 0.066** 0.066**
## (0.015) (0.015) (0.021) (0.021) (0.021)
## sulphates 0.593*** 0.637*** 0.635*** 0.631***
## (0.101) (0.101) (0.101) (0.100)
## pH 0.708*** 0.711*** 0.685***
## (0.105) (0.105) (0.105)
## citric.acid 0.039
## (0.096)
## free.sulfur.dioxide 0.004***
## (0.001)
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
## adj. R-squared 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
## sigma 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
## F 1142.0 581.1 432.6 437.5 350.4 294.3 253.9 228.1 209.6 188.6 191.3
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -5838.0 -5829.7 -5775.4 -5602.6 -5601.8 -5596.6 -5592.1 -5574.8 -5552.2 -5552.1 -5542.4
## Deviance 3112.3 3101.8 3033.7 2827.0 2826.0 2820.0 2814.8 2795.0 2769.3 2769.2 2758.2
## AIC 11682.0 11667.5 11560.9 11217.3 11217.6 11209.2 11202.2 11169.7 11126.4 11128.3 11108.8
## BIC 11701.5 11693.5 11593.3 11256.3 11263.1 11261.2 11260.7 11234.6 11197.9 11206.2 11186.8
## N 4896 4896 4896 4896 4896 4896 4896 4896 4896 4896 4896
## ==============================================================================================================================================================
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = subset(wine, alcohol <=
## quantile(wine$alcohol, 0.999)))
## m2: lm(formula = quality ~ alcohol + density, data = subset(wine,
## alcohol <= quantile(wine$alcohol, 0.999)))
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = subset(wine,
## alcohol <= quantile(wine$alcohol, 0.999)))
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity,
## data = subset(wine, alcohol <= quantile(wine$alcohol, 0.999)))
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## fixed.acidity, data = subset(wine, alcohol <= quantile(wine$alcohol,
## 0.999)))
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## fixed.acidity + sulphates, data = subset(wine, alcohol <=
## quantile(wine$alcohol, 0.999)))
## m7: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## fixed.acidity + sulphates + pH, data = subset(wine, alcohol <=
## quantile(wine$alcohol, 0.999)))
## m8: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## fixed.acidity + sulphates + pH + free.sulfur.dioxide, data = subset(wine,
## alcohol <= quantile(wine$alcohol, 0.999)))
##
## ========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8
## ------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.510*** 90.296*** 74.262*** 52.915*** 81.496*** 157.578*** 154.204***
## (0.098) (6.169) (12.377) (11.979) (13.742) (14.452) (18.137) (18.106)
## alcohol 0.313*** 0.360*** 0.246*** 0.286*** 0.310*** 0.277*** 0.182*** 0.193***
## (0.009) (0.015) (0.018) (0.018) (0.019) (0.020) (0.024) (0.024)
## density 24.746*** -87.870*** -71.580*** -49.973*** -78.888*** -157.620*** -154.388***
## (6.083) (12.320) (11.925) (13.736) (14.463) (18.382) (18.350)
## residual.sugar 0.053*** 0.052*** 0.045*** 0.056*** 0.087*** 0.083***
## (0.005) (0.005) (0.005) (0.006) (0.007) (0.007)
## volatile.acidity -2.062*** -2.084*** -2.039*** -1.945*** -1.890***
## (0.109) (0.109) (0.109) (0.109) (0.110)
## fixed.acidity -0.047** -0.030* 0.065** 0.068***
## (0.015) (0.015) (0.020) (0.020)
## sulphates 0.620*** 0.660*** 0.628***
## (0.100) (0.100) (0.100)
## pH 0.713*** 0.695***
## (0.104) (0.103)
## free.sulfur.dioxide 0.003***
## (0.001)
## ------------------------------------------------------------------------------------------------------------------------
## R-squared 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3
## adj. R-squared 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3
## sigma 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
## F 1142.0 581.1 432.6 437.5 352.6 302.5 268.5 239.1
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -5838.0 -5829.7 -5775.4 -5602.6 -5597.7 -5578.6 -5555.0 -5542.8
## Deviance 3112.3 3101.8 3033.7 2827.0 2821.2 2799.4 2772.5 2758.7
## AIC 11682.0 11667.5 11560.9 11217.3 11209.3 11173.3 11128.0 11105.6
## BIC 11701.5 11693.5 11593.3 11256.3 11254.8 11225.3 11186.5 11170.5
## N 4896 4896 4896 4896 4896 4896 4896 4896
## ========================================================================================================================
##
## Call:
## lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## fixed.acidity + sulphates + pH + free.sulfur.dioxide, data = subset(wine,
## alcohol <= quantile(wine$alcohol, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8240 -0.4942 -0.0403 0.4667 3.1206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.542e+02 1.811e+01 8.517 < 2e-16 ***
## alcohol 1.928e-01 2.411e-02 7.998 1.57e-15 ***
## density -1.544e+02 1.835e+01 -8.414 < 2e-16 ***
## residual.sugar 8.286e-02 7.289e-03 11.368 < 2e-16 ***
## volatile.acidity -1.890e+00 1.096e-01 -17.242 < 2e-16 ***
## fixed.acidity 6.827e-02 2.044e-02 3.340 0.000843 ***
## sulphates 6.276e-01 1.000e-01 6.273 3.84e-10 ***
## pH 6.948e-01 1.034e-01 6.720 2.02e-11 ***
## free.sulfur.dioxide 3.347e-03 6.767e-04 4.946 7.82e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7513 on 4887 degrees of freedom
## Multiple R-squared: 0.2813, Adjusted R-squared: 0.2801
## F-statistic: 239.1 on 8 and 4887 DF, p-value: < 2.2e-16
The variables in this linear model can account for 28.0% of the variance in the quality of white wines.
In general, higher quality scores associate with higher alcohol values, lower density values and lower residual.sugar values. However, when holding density value constant, white wines with higher quality scores are most likely the ones have higher residual.sugar values. Since each chemical property does not have a very strong relationship with the quality scores, it suggests me to try a linear model by adding in all these variables as a start, then to see which ones will play significant roles in the model.
The quality scores do not cluster by pH and fixed.acidity.
Yes, I created a linear model starting from the quality and alcohol. The alcohol can account for 18.9% of the variance in the quality of white wines. When adding in other chemical variables, the final model containing 8 chemical variables can account for 28.0% of the variance in the quality of white wines.
After omitting the top 0.1% values, the distribution of density appears to be normal, ranging from 0.987 to 1.002, with the peak around 0.992 to 0.995.
White wines with the highest quality scores have the highest alcohol level, and the lowest denstiy. The alcohol variance is larger in the wines which are scored as 6, 7 and 8. The density variance is larger in the wines which are scored as 5 and 6. The wines which are scored as 3 or 4 do not show difference in the alcohol level and density.
When holding density value constant, white wines with higher residual sugar level more likely have higher quality scores. The plot indicates that a linear model might be built to predict the quality of white wine, if including density and residual sugar levels as predictor variables.
The white wine data set contains almost 5000 white wines across 12 variables. The quality evaluation score is the outcome variable, while the other 11 chemical property variables are treated as candidate predictor variables. In order to understand which chemical properties may influence the quality of white wines, I started by understanding the distribution of each variables, and then explored the relationships between each pair of interested variables. Eventurally, I built a linear model using 8 out of 11 chemical variables. This model can account for 28.0% of the variance in the quality of white wines.
There was no a clear strong trend between quality and each chemical variable. The highest correlation coefficient was 0.436 in alcohol. Therefore, it was hard to pick the feature(s) of interest at the beginning. I started from pair-wise plots, noticing some chemical properties were inter-correlated, such as density, alcohol, residual sugar and total.sulfur.dioxide. I struggled understanding the relationship of quality and these inter-correlated chemical properties. After transforming the quality into an ordered factor, the retionships were more clear on the multivariate plots.
My final linear model only be able to account for 28.0% of the variance in the quality of white wine. The predictive power of this model is weak. Given that the quality score can be treated as an ordered factor, ordinal regression or other predictive models, such as machine learning, may be a better option for this data set.